PACKAGE INSTALLATIONS

The whole process of my attempt to predict the Hotel Occupancy Rate (TPK BPS) was carried out using Jupyter Notebook version 6.1.6 on Python 3.8.2 x64 for Windows.

These are the libraries I used in this competition:

  • Pandas, for the data processing using table-like form
  • Numpy, for the data processing using array-like form
  • Scikit-learn, for the machine learning tasks
  • Plotly, for data graphing
  • Matplotlib for data plotting

The committee only gave us Daily Hotel Occupancy Rate retrieved online (tpk_harian) as X variable and Monthly Hotel Occupancy Rate published by BPS (tpk_bps) as Y variable The lack of data encouraged me to get other sources as follows:

  • covid_harian_aktif = daily covid active cases, retrieved from KawalCovid19
  • covid_harian = daily new covid cases, retrieved from KawalCovid19
  • covid_total = total cases of covid at the end of the month (last day), Retrieved from KawalCovid19
  • penerbangan = the number of flight passengers to Bali, retrieved from bali.bps.go.id
  • wisatawan = the number of domestic tourists coming to Bali, retrieved from bps.go.id and from Contact Person from Disparda Bali (Dinas Pariwisata Bali)
  • wisatawan_mancanegara = the number of foreign tourists coming to Bali, retrieved from bali.bps.go.id
  • tpk_bps_arima = the data of Monthly Hotel Occupancy Rate published by BPS (tpk_bps) from the previous months (y-1)
  • hari = the number of days in a month
  • mobility = google mobility data index for INDONESIA (not Bali in particular), retrieved from OurWorldInData.Org

However, after numerous trials, I found out that only 3 independent variables; tpk_online, penerbangan and wisatawan give significant decrease on the best model's RMSE.

Therefore, I included the initial codes of data pre-processing for the other insignificant variables. However, in the model.fit and model.predict, only the significant variables are used as the final predictors.

As for the models, here are the ones I tried running on the data:

  • Linear Regression
  • Ridge Regression
  • Random Forest Regressor
  • Support Vector Regression (SVR)
  • K-Nearest Neighbor Regressor
  • MLPRegressor (Neural Network Regression)
  • Lasso Regression
  • Decision Tree Regressor

Out of all eight, the ones that frequently appear to have the lowest values of RMSE are: Random Forest, Lasso and and SVR. Random Forest and Lasso work best with more independent variables, but the RMSE values are still higher than of SVR with less indepedent variables.

READING CSV FILES

DATA DENOISING

HANDLING OUTLIERS ON covid_harian

The following code of HANDLING OUTLIERS and CALCULATING MONTHLY DATA for covid_harian are used interchangeably for covid_harian_aktif.

CALCULATING MONTHLY DATA on covid_harian

CALCULATE DAILY OCCUPANCY RATE (tpk_harian)

The original data does not have tpk_online, below is the function to calculate it based on the available rooms and total rooms in the data.

OUTLIER DETECTION AND REMOVAL ON tpk_harian

CALCULATE MONTHLY DATA ON tpk_harian

Sementara pake Mean, jika dirasa ada metode agregasi lain yang lebih mewakili, bisa dicoba disini

CALCULATE MONTHLY MOBILITY INDEX

MERGE DATA FRAME

EXPORT DATA FRAME

EXPLORATORY DATA ANALYSIS (EDA)

This step hopes to see the rough pattern between tpk_bps (Y variable) against every independent variables. Library scatter plot (imported in the beginning of this source code file) is used for this purpose.

PREPARATION FOR VARIABLE CHOICES

The following is to test the correlations between variable Y and each variable X.

CORRELATION BETWEEN X AND Y

COMBINING DATA_X FOR MODEL TRAINING (model.fit)

PREPARE TEST DATA for TPK PREDICTION in January to June 2021

tpk_online TEST

covid_harian TEST

The following code is used interchangeably for variable covid_harian_aktif (that variable is not significant so I choose to not waste so much space by rewriting the code).

penerbangan TEST

wisatawan TEST

tpk_bps_arima TEST

hari TEST

covid_total TEST

mobility TEST

MERGE DATA TEST

PREPARING TRUE Y VALUE (TRUE tpk_bps)

BUILDING MODELS WITH SCIKIT LEARN

The following are the models I tried running on the data:

  • Linear Regression
  • Ridge Regression
  • Random Forest Regressor
  • Support Vector Regression (SVR)
  • K-Nearest Neighbor Regressor
  • MLPRegressor (Neural Network Regression)
  • Lasso Regression
  • Decision Tree Regressor

LINEAR REGRESSION

RIDGE REGRESSION

RANDOM FOREST

SUPPORT VECTOR REGRESSION

k-NEAREST NEIGHBOR REGRESSION

NEURAL NETWORK REGRESSION

LASSO REGRESSION

This model relatively works best when I use more variables, however the RMSE is still higher than of SVR with just 3 independent variables

DECISION TREE REGRESSION

CHOOSING BEST METHOD

Berdasarkan nilai score mean absolute error-nya